CASE STUDY: Bellabeat Fitness Data Analysis

The case study follows the six step data analysis process: - ask - prepare - process - analyze - share - act

Scenario

Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. I’ve been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company.

ASK

Business Task: Identifying the trends of smart device usage and making data-driven business strategies to determine the opportunities for growth.

Questions for the analysis: 1. What are some trends in smart device usage? 2. How could these trends apply to Bellabeat customers? 3. How could these trends help influence Bellabeat marketing strategy?

Primary stakeholders: Urška Sršen and Sando Mur, executive team members. Secondary stakeholders: Bellabeat marketing analytics team.

PREPARE

  • The data for this analysis will come from FitBit Fitness Tracker Data on Kaggle.
  • This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.
  • The dataset has in total 18 files in .csv format organized in long format.
  • Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID or timestamp. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences. Limitations for this dataset exist due to the sample size, as the dataset was collected from only 30 users, and lack of key characteristics of the participants, such as age, gender, location, lifestyle. The analysis will be focused on daily usage of the Fitbit device, that’s why the following datasets will be used:
  • dailyActivity_merged
  • weightLogInfo_merged
  • sleepDay_merged

PROCESS

R Studio was used to complete this analysis because of the many packages and data visualization features available to explore the data with. ### Installing the packages

install.packages("plotly", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'plotly' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\svlas\OneDrive\Dokumente\R\win-
## library\4.1\00LOCK\dplyr\libs\x64\dplyr.dll to C:
## \Users\svlas\OneDrive\Dokumente\R\win-library\4.1\dplyr\libs\x64\dplyr.dll:
## Permission denied
## Warning: restored 'dplyr'
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("here", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'here' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("janitor", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'janitor' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("skimr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'skimr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("lubridate", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'lubridate' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lubridate'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\svlas\OneDrive\Dokumente\R\win-
## library\4.1\00LOCK\lubridate\libs\x64\lubridate.dll
## to C:\Users\svlas\OneDrive\Dokumente\R\win-
## library\4.1\lubridate\libs\x64\lubridate.dll: Permission denied
## Warning: restored 'lubridate'
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
install.packages("RColorBrewer", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/svlas/OneDrive/Dokumente/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'RColorBrewer' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\svlas\AppData\Local\Temp\RtmpW09t0R\downloaded_packages
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.6     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(here)
## here() starts at C:/sources/R_project
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(skimr)
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(RColorBrewer)

Importing datasets

dailyActivity <- read.csv("C:\\sources\\R_project\\dailyActivity.csv")
View(dailyActivity)
sleepDay_merged <- read.csv("C:\\sources\\R_project\\sleepDay_merged.csv")
View(sleepDay_merged)
weightLogInfo_merged <- read.csv("C:\\sources\\R_project\\weightLogInfo_merged.csv")
View(weightLogInfo_merged)

Viewing the data frames

To be sure that the data frames were imported correctly, the head() function is used. The colnames() and glimpse() functions were used to determine what variables are included in the data frames and find commonalities

head(dailyActivity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
glimpse(dailyActivity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
colnames(dailyActivity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
head(sleepDay_merged)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
glimpse(sleepDay_merged)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
colnames(sleepDay_merged)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(weightLogInfo_merged)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12
glimpse(weightLogInfo_merged)
## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat            <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <chr> "True", "True", "False", "True", "True", "True", "True"~
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~
colnames(weightLogInfo_merged)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Converting data type from character to date time

As we can see, the data type of the Date column is character. We need to convert the data type to date time in all of three data frames.

dailyActivity$ActivityDateNew <- strptime(dailyActivity$ActivityDate, '%m/%d/%Y')
class(dailyActivity$ActivityDateNew)
## [1] "POSIXlt" "POSIXt"
sleepDay_merged$SleepDayNew <- strptime(sleepDay_merged$SleepDay, '%m/%d/%Y %I:%M:%S %p')
class(sleepDay_merged$SleepDayNew)
## [1] "POSIXlt" "POSIXt"
weightLogInfo_merged$DateNew <- strptime(weightLogInfo_merged$Date, '%m/%d/%Y %I:%M:%S %p')
class(weightLogInfo_merged$DateNew)
## [1] "POSIXlt" "POSIXt"

Adding a column for the day of the week

dailyActivity$day_of_week <- format(as.Date(dailyActivity$ActivityDateNew), "%A")
dailyActivity$day_of_week <- ordered(dailyActivity$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

ANALYZE

Summarizing the data

We use the n_distinct() function to determine the number of unique values.

n_distinct(dailyActivity$Id)
## [1] 33
n_distinct(sleepDay_merged$Id)
## [1] 24
n_distinct(weightLogInfo_merged$Id)
## [1] 8

As we can see, the dailyActivity database has 33 unique users, at the same time weightLogInfo database has only 8 unique users, and sleepDay database - 24. That means that reliable insights and recommendations cannot be made from these data frames.

To get key statistics about the data frames we use the summary() function.

dailyActivity%>%
  select(TotalSteps,TotalDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories)%>%
  summary()
##    TotalSteps    TotalDistance    VeryActiveMinutes FairlyActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :  0.00    Min.   :  0.00     
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:  0.00    1st Qu.:  0.00     
##  Median : 7406   Median : 5.245   Median :  4.00    Median :  6.00     
##  Mean   : 7638   Mean   : 5.490   Mean   : 21.16    Mean   : 13.56     
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 32.00    3rd Qu.: 19.00     
##  Max.   :36019   Max.   :28.030   Max.   :210.00    Max.   :143.00     
##  LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:1828  
##  Median :199.0        Median :1057.5   Median :2134  
##  Mean   :192.8        Mean   : 991.2   Mean   :2304  
##  3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :518.0        Max.   :1440.0   Max.   :4900

As we can see, the average user takes 7638 steps per day, which is not enough because a reasonable target for healthy adults is 10 000 steps. According to recommendations of the World Health Organization Adults should do at least 75-150 minutes of intensity aerobic physical activity. The average Fitbit user gets 21.16 minutes of very active activity a day, which means 148.12 minutes a week. That means that the fitbit users make enough activity for substantial health benefits. However, the average amount of sedentary is 991.2 minutes, which equals 16.52 hours a day. Adults should start by doing small amounts of physical activity, and gradually increase the frequency, intensity and duration over time. Besides, the average calorie burn per day reaches 2304 calories. The average person burns around 1800 calories a day doing absolutely nothing. However, we can not make any conclusions, because we know nothing about the gender, or age of Fitbit users.

sleepDay_merged%>%
  select(TotalMinutesAsleep,TotalTimeInBed)%>%
  summary()
##  TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:403.0  
##  Median :433.0      Median :463.0  
##  Mean   :419.5      Mean   :458.6  
##  3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :961.0

According to the data, the average Fitbit user sleeps 419.5 minutes a day, which equals 6.99 hours a day. National Sleep Foundation guidelines advise that healthy adults need between 7 and 9 hours of sleep per night.

weightLogInfo_merged%>%
  select(WeightKg,BMI)%>%
  summary()
##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

As we already know, this data frame has a low number of participants, the average BMI is 25.19. This is considered an overweight BMI. However, BMI can be a screening tool and does not diagnose the body fatness or health of an individual.

SHARE

To build some data visualizations for analysis I used the ggplot2 package. This allows us to show patterns and trends found in the data frames in a more visual way.

-1-

## `geom_smooth()` using formula 'y ~ x'

The first plot demonstrates the relationship between daily average active minutes (lightly + fairly + very active minutes) and daily average calories burned. The more vigorous physical activity the participant did, the more calories they burned.

-2-

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The second visualization shows that the more steps the Fitbit users took, the more calories they burned.

-3-

According to this bar graph we can see that the average Fitbit user takes maximum steps on Saturday and Sunday.

-4-

To create the next visualization we need to prepare our data. First, we calculate TotalMinutes that Fitbit users spend in all kinds of activities. Besides, we need to get a total amount of minutes for each kind of activity : sedentary, fairly active, lightly active and very active.

dailyActivity$TotalMinutes <- sum(dailyActivity$VeryActiveMinutes,dailyActivity$FairlyActiveMinutes,dailyActivity$LightlyActiveMinutes,dailyActivity$SedentaryMinutes)
dailyActivity$TotalActiveMinutes <- sum(dailyActivity$VeryActiveMinutes)
dailyActivity$TotalFairlyActiveMinutes <- sum(dailyActivity$FairlyActiveMinutes)
dailyActivity$TotalLightlyActiveMinutes <- sum(dailyActivity$LightlyActiveMinutes)
dailyActivity$TotalSedentaryMinutes <- sum(dailyActivity$SedentaryMinutes)

After this step, it becomes possible to calculate a percentage of each kind of activity.

dailyActivity$SedentaryPercentage <- dailyActivity$TotalSedentaryMinutes/dailyActivity$TotalMinutes*100
dailyActivity$ActivePercentage <- dailyActivity$TotalActiveMinutes/dailyActivity$TotalMinutes*100
dailyActivity$FairlyPercentage <- dailyActivity$TotalFairlyActiveMinutes/dailyActivity$TotalMinutes*100
dailyActivity$LightlyPercentage <- dailyActivity$TotalLightlyActiveMinutes/dailyActivity$TotalMinutes*100

To make a pie chart we need to create a new dataframe for percentage.

ActiveMinutes <- c("Sedentary", "Lightly", "Fairly", "Very Active")
Percent <- c(81.33, 15.82, 1.11, 1.73)
percentage_data <- data.frame(ActiveMinutes,Percent)
View(percentage_data)

Another way to create a pie chart:

## NULL

From the pie chart, we can see that most users spent 81.3% of their daily activity in sedentary minutes and only 1.73% in very active minutes.

-5-

To visualize a relationship between asleep minutes and burned calories, first we need to merge two data frames: dailyActivity and sleepDay_merged, using the merge function. These data frames are joined by one common key variable - Id.

daily_activity_merged <- merge(dailyActivity, sleepDay_merged, by = 'Id')
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ACT

Recommendations to business

First of all this data in this study only included the information from 33 people over the course of about a month. That means that there is not enough data to provide deep analysis and we need to do more research. Based on the analysis, we can determine that sedentaries make up a significant portion, 81% of users daily active minutes. Users spend on average 12 hours a day in sedentary minutes, 4 hours lightly active, and only half-hour in fairly+very active. To make their users more active, Bellabeat can include a function in the Bellabeat app to alert users who tend to have a high number to sedentary minutes. Also we see that on Saturday users take more steps, burn more calories, and spend less time sedentary. It would be useful if Bellabeat could provide users with some training programs or challenges to have short active exercises during the week.